We build a linear regression model that predicts property assessment values of single-family dwellings in Strathcona County, Alberta, Canada. Our model is restricted to 2018 data, and models property assessment values based on several property-related attributes such as building size, age, and building type. Our quick-and-dirty regression performed adequately well on out-of-sample data, with an R-square of 0.78.
Property taxes are a major source of revenue that funds municipal operations. Mass market appraisal is the common approach municipalities take to determine the fair-market value of the properties it collects taxes on. However, few municipalities release publicly their property assessments, and fewer still publicly release property-attributes tied to this mass market appraisal. Strathcona County is an exception, and releases annual property assessment information for all of its properties on its Open Data portal.
The data set used in our project are 2018 property assessments, restricted strictly for single-family dwellings in Strathcona County, Alberta. Each row in the data set represents a distinct property within Strathcona Counties borders, and each column represents a potential explanatory variable, along with our dependent variable (property assessment value). Our set of explanatory variables is composed of continuous, categorical, and binary data types, these being: Building Size, Building Description, Age of Property, Year Built, Presence of Basement, Presence of Furnished Basement, Presence of Garage, Presence of Fireplace, and Longitude and Latitude. Altogether our dataset is composed of 28,450 observations. To get an accurate assessment of our predictive model, we build a training model using 90% of our observations, and reserve the remaining 10% as test data to assess goodness of fit. Our data pipeline modelling involved several steps. First, we identified our features by their data types. Each of these data types were then subject to different transformations. For instance, our binary features were in the form of “Yes/No” and needed to be converted into 0’s and 1’s via One Hot Encoding. We also had a categorical feature that takes on several dozen distinct values, and required converting to a set of binary variables via the use of One Hot Encoding.
Our dependent variable - property assessment value - exhibits right skew. Most property assessment values are centered between $400,000 and $500,000. While almost no property is worth less than $300,000, there are a number of outlier properties that exceed $1,000,000 and approach up to $2,000,000.
Figure 1. Assessment value frequency distribution
A quick glance of our data found a small number of observations with missing values for particular features. These observations were later dropped from our model.
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 25605 entries, 0 to 25604
## Data columns (total 12 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 YEAR_BUILT 25596 non-null float64
## 1 ASSESSCLAS 25605 non-null object
## 2 BLDG_DESC 25596 non-null object
## 3 BLDG_FEET 25605 non-null int64
## 4 GARAGE 25605 non-null object
## 5 FIREPLACE 25605 non-null object
## 6 BASEMENT 25605 non-null object
## 7 BSMTDEVL 25605 non-null object
## 8 ASSESSMENT 25605 non-null int64
## 9 LATITUDE 25605 non-null float64
## 10 LONGITUDE 25605 non-null float64
## 11 AGE 25596 non-null float64
## dtypes: float64(4), int64(2), object(6)
## memory usage: 2.3+ MB
The correlation heat map -restricted to only our numeric features - shows that the property assessment value has its strongest correlation with square footage (with a correlation of 0.79). The second most correlated feature to property assessment value is age of the property (correlation of 0.36) which can be reflected from the year built feature.
Figure 2. Housing features correlation heatmap
A closer look at a scatter plot of property assessment values and square footage shows a clear positive association. However, its apparent that the tightness of this relationship loosens as property assessment values become increasingly large.